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Abstract 

The Zeitgeist for reform in education precipitated a number of changes in 
assessment. Among these are performance assessments, sometimes linked to 
“high stakes” accountability decisions. In some instances the trustworthiness 
of these decisions are based upon variance components and error variances 
derived through generalizability theory. Often overlooked is the fact that 
these statistics are subject to sampling error. This paper introduces tech- 
niques used to determine the accuracy of such statistics. 



Introduction 

The climate today in educational assessment is vastly different than it was twenty 
years ago. Where standardized achievement tests once dominated the landscape, numerous 
competitor assessments regularly appear, assessments that incorporate previously unheard- 
of formats. The nontraditional nature of these assessments is an outgrowth of the search 
for tests that provide a measure of validity that typical multiple choice tests can’t. Validity, 
however, is only a partial measure of the soundness of any measurement instrument. If a 
test’s results aren’t reliable, its superior validity is of little consequence. Older notions of 
reliability, like with antiquated notions of validity, don’t possess the sophistication neces- 
sary to determine the accuracy of such instruments. Recent initiatives linking test results to 
rewards and penalties at the state, local and individual level are not uncommon. Providing 
a measure of accuracy to such results is critical. The statistical techniques developed to ad- 
dress these issues are typically referred to under the comprehensive heading generalizability 
theory. 

Introduced nearly 30 years ago, today generalizability theory is one of the major 
statistical techniques used in assessment. Its adaptability makes it the model of choice in 
a number of assessment situations. Particularly with performance assessments, where IRT 
based procedures are inappropriate due to the small number of items, generalizability theory 
is sound. Depending upon the desired use of the assessment (e.g., to measure achievement 
at the individual level, the school level, or perhaps at the district level), generalizability 
theory allows for the computation of multifaceted error estimates that provide the most 
complete measure of reliability available from any procedure today. This subdivision of 
error is particularly valuable in pinpointing exactly where to improve an assessment. 
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Beginning with a model that includes all relevant facets and their interactions, the 
initial G study provides variance component estimates for both the main effects and for 
interactions associated with the model. 1 Using these variance components, the D study 
provides variance components associated with the means for sets of sampled conditions. 
These in turn yield the the absolute and relative error variances, <x 2 (A) and <t 2 (<5), re- 
spectively. The variance components and error variances provide both specific and overall 
information about how accurate the results from a test are and where sources of inaccu- 
racy arise. More specifically, variance components provide a measure of the variability for 
each effect were it possible to collect numerous scores for the same student or aggregate of 
interest over all conditions in the universe of admissible observations. Similarly, <x 2 (A) and 
a 2 (6) provide a composite measure of variability for the same hypothetical collection of 
scores. 

Any prudent user of a technique like generalizability theory must recognize that vari- 
ance components and error variances are statistics and, as such, are subject to sampling 
error. The variance of variance components and error variances represents the fluctuation 
one might expect in those statistics were it possible to perform multiple G studies using stu- 
dents, schools, tasks, and raters from the same universe of admissible observations. Whereas 
the estimated variance components and error variances gauge the accuracy of the instru- 
ment with respect to a single administration, the standard error associated with variance 
components and error variances provides a measure of fidelity for the instrument across 
multiple administrations. Particularly with respect to decisions involving significant conse- 
quences, the extent of sampling error must be determined and accounted for. Determining 
the amount of sampling error is crucial to placing any faith in the use of a statistic and, in 
turn, any decision based upon that statistic (Cronbach, Linn, Brennan, &; Haertel, 1995). 

Brennan (1992) cites two methods for estimating standard errors of variance com- 
ponents. Unfortunately, because the exact distributions of variance components and error 
variances are very complex, even unbiased estimators of the variance of variance components 
become difficult to interpret. A more reliable method which bypasses this distributional dif- 
ficulty is to compute confidence intervals. Previously, researchers relied on two procedures, 
the Satterthwaite and the Welch, to derive confidence intervals for variance components. 2 
Though appropriate in a number of situations (Smith, 1936; Satterthwaite, 1941, 1946; 
Welch, 1956), a number of inadequacies make these two procedures less than ideal for use 
with generalizability theory applications. 

The Satterthwaite procedure assumes large values of degrees of freedom for each 
source of variance and, in addition, assumes the difference in degrees of freedom across the 
sources of variation to be small. Both the Welch and Satterthwaite procedures assume that 
the estimated variance component about which the confidence interval is constructed be 
a linear combination of expected mean squares with only positive coefficients (Burdick & 
GraybiU, 1992, pp. 30-31). That is, if 

1 Point estimates of variance components referred to in this paper are computed using linear combinations 
of mean squares. The results presented in this paper apply only to variance component estimates of this 
type. 

More specifically, the confidence intervals are for linear combinations of expected mean squares with 
positive scalar coefficients. 
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o 2 ^) = ^2ckE(Sl) c k > 0, (1) 

k = 1 

then the Welch and Satterthwaite procedures are appropriate for confidence interval esti- 
mation. This last restriction proves fatal to any intended application one might put these 
two procedures to in generalizability theory. 

As an example, consider the simple one-facet p x i design. If n* and n p denote the 
sample sizes associated with each facet, then the following equations provide the three 
variance components as functions of expected mean squares: 



<7 2 (p) = [S(55)-^)]/ni 

<7 2 (i) = [E(Sf) — E(Sp i )]/n p 
a 2 {pi) = E(S 2 i) 

Clearly, the scalar coefficients of the expected mean squares are not all positive, invalidating 
the use of both the Welch and Satterthwaite procedures. Worse yet, in typical implemen- 
tations of this design, where p and z are main effects associated with persons and items, 
respectively, it is not at all uncommon for n p to be significantly larger than rij, undermining 
one of the Satterthwaite assumptions. 

This simple design illustrates the need for alternate confidence interval estimation 
procedures for variance components and error variances in generalizability theory. The 
purpose of this article is to present new results concerning confidence intervals of variance 
components to statistics encountered using fully random balanced designs in generalizability 
theory. These results are tested for accuracy across numerous simple and complex designs 
found in generalizability theory applications. Appearing to overcome many of the short- 
comings of the Welch and Satterthwaite procedures with respect to variance components, 
this research also yields the ability to give accurate confidence intervals about the most 
important statistic computed in generalizability theory, a 2 (A). 

Details of recent research 

The variance components and error variances encountered using various designs in 
generalizability theory require alternate confidence interval construction procedures - pro- 
cedures allowing both signed coefficients in Equation 1 and highly variable degrees of free- 
dom. With respect to variance components, even the simplest designs produce variance 
components that are linear combinations of expected mean squares with both positive and 
negative coefficients. With respect to error variances, linear combinations of expected mean 
squares with both positive and negative coefficients and with only positive coefficients are 
not uncommon. Two recently derived procedures accommodate these scenarios and, in gen- 
eral, provide superior confidence coefficients to those produced using the Satterthwaite and 
Welch procedures. 

Before presenting the specific results, a brief survey on confidence intervals warrants 
presentation. Let a designate a prescribed significance level, then 1 — 2a denotes the 



CONFIDENCE INTERVALS IN GENERALIZABILITY THEORY 



4 



confidence coefficient of the two-sided interval {o 2 ( 7 ) : L < cr 2 ( 7 ) < U} and 1 — a denotes 
the confidence coefficient of the two one-side intervals {cr 2 ^) : L < cr 2 ( 7 ) < 00 } and 
{ct 2 ( 7 ) : 0 < cr 2 (y) < Uj. More formally, 1 - 2a = P[L < a 2 ^) < U] and 1 - a = P[0 < 
a (7) < U] = P[L < cr 2 ( 7 ) < 00 ]. In relatively few cases do the preceding equalities hold. 
When they do such intervals are called exact. In cases when equality fails to hold, intervals 
are called approximate. Moreover, if the probability exceeds the designated confidence 
coefficient, then the approximate interval is called conservative; otherwise the approximate 
interval is called liberal. Three types of approximate intervals are hereafter considered: 
Upper intervals of the form [L, 00 ), lower intervals of the form [0, t/1, and two-sided intervals 
of the form [L, U] . 

Graybill and Wang (1980) and Ting, Burdick, Graybill and Gui (1989) began by 
considering confidence intervals on positive sums of expected mean squares. Building on 
these results, Lu, Graybill and Burdick (1988) and Ting, Burdick, Graybill, Jeyaratnam, 
and Lu (1990) developed confidence interval estimation procedures on linear combinations 
of expected mean squares with both signs. With respect to the former, the researchers 
employed a modified large-sample procedure similar to that suggested by Welch (1956). 
Let o 2 ( 7 ) be defined as in Equation 1 and let o ' 2 ( 7 ) be defined as follows: 

n 

^ 2 (7 ) = ^c fc Sf Cfc > 0. 

fc=i 

Graybill and Wang defined the confidence interval containing a 2 ( 7 ) as follows: 



where 



*( 7 )- 



N 



G l c l S k < £j2 (7) < d- 2 (7) + 

fc=l 



\ 









(2) 



G k = 1 - and H k = -=r~^ 1 

^a:n k} oo * l-ccn^ ,oo 

Using two methods, the authors tested the accuracy of these intervals against those 
available from the Satterthwaite and Welch procedures: (a) when k = 2 the authors em- 
ployed numerical integration via an elegant result due to Fleiss (1971), and (b) when k > 2, 
the authors conducted simulation studies based upon 10,000 replications. In their study 
of two-sided intervals containing cr 2 (7), Graybill and Wang demonstrated that their pro- 
cedure was superior to those provided by Satterthwaite and Welch in cases where k = 2. 
Specifically, their confidence coefficients maintained at the stated ct-level or were conserva- 
tive across varying degrees of freedom. Later tests with k > 2 on one-sided intervals gave 
less conclusive results (Ting et al., 1989). On lower intervals the Graybill- Wang interval 
was superior to its Satterthwaite and Welch counterparts. In contrast, the Graybill- Wang 
interval was somewhat liberal with respect to upper intervals. 

In cases where the variance component is a sum of expected mean squares with coef- 
ficients of both signs, analogous equations result from the modified large-sample approach 
just considered. Consider <j 2 ( 7) defined by the following equation: 
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<7 2 (7) = £<VE(S, 2 ) - E CrB(S»), (3) 

< 7=1 r=j+l 

where c*, 1 < i < A;, are positive. If a 2 ( 7 ) is defined by 

^(l) = E C <I S Q ~ E Cr 5 r. 

<7=1 r=j+l 

where 5 j represents the mean square associated with effect i, then the upper bound for a 
lower 1 — a confidence interval is given by (Burdick & Graybill, 1992) 



U = 



o 2 




(4) 



where 



v„ = Y H ttS 4 q + E G r C r S r 

<7=1 r==7+l 

j k k X k 

+ E E H qr c qCr S 2 q Sl + E E KtCrCtSlS 2 

q—\ r=j+ 1 r=j+l £>r 

^<7 — 1 9 = 1 > • • • 5 j 

■Tl—a:n q ,oo 



G r ~ 1 — 



r = j + 1 , • • • , k 



H q r = 



Kt = 



F ar:n r ,oo 

(1 - Fi_ a:ntinr ) 2 - - Gl 

a:n qi n r 



and 



(n r + rtt) 2 Gln r G\n t 



^a:n r +n t ,o 



n r n t 



n t 



U r 



l (k-j-1) 

t = r + 1 , . . . , k 



A similarly formidable set of equations gives the lower bound for an upper 1 — a confidence 
interval. 



L = <J 2 ( 7 ) - y/V^, 



(5) 



where 
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V. 



G q = 1- 
H r ~ 

G qr 



Gqu 



Y.cyx + y Hyx 

9=1 r=j+l 

+ EE GqrCqCrSqS, + £ £ 
9=1 r=j+l g=l u>q 

1 

Q = 1 , ■ ■ ■ ,j 



F 

1 a:n q , c 



- 1 



r = j + 1, 



^l—a:n r ,oo 

- 1? - G\F> - H? 



and 



a:n^,n r 



' (l 1 ' 


2 

\ (riq + n u ) 2 GqTlq 


Gln u 


\ Fa:n q +n Ut oo y 


f TtqTlxi 7l u 


n q 



O' - 1) 



« = ? + !, •••.; 



In a similar fashion as Graybill and Wang (1980) and Ting et al. (1989), Ting et 
al. (1990) applied two methods to test the prescribed intervals accuracy depending upon 
the number of terms in Equation 3. When a 2 (y) = ciE(Sl) - c 2 ^(5|), the authors utilized 
numeric integration and the corrolary due to Fleiss (1971) to determine the accuracy of the 
confidence coefficients. When k > 2 the authors conducted a simulation study based upon 
10,000 replications for all possible signed combinations of the expected mean squares with 
variable degrees of freedom across the different signed combinations. Put in the context 
of the variance components encountered in generalizability theory, their results cover all 
variance components encountered in one and two facet designs as well as most encountered 
in three facet designs. Results (Burdick & Graybill, 1992, Table 3.3.1, p.42) are superior 
across the three types of confidence intervals. Particularly impressive were the results for 
the lower and two-sided intervals - these results either held at their stated a-level or were 
conservative. Like with the Graybill- Wang interval, the upper one-sided interval, under 
certain conditions, proved too liberal. 

Methodology and results 

Building on the results of Ting et al. (1989) and Ting et al. (1990), this paper further 
examines their confidence interval estimation methods and tests their appropriateness for 
<t 2 (A) and a 2 (6) in all possible one and two facet designs, and select three facet designs. 
Consider the following inequalities implied from Equations 4 and 5: 




( 6 ) 
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Table 1: Simulated Ranges of 95% Confidence Coefficients of Lower and Upper Intervals and 90% 
Two-Sided Intervals on <r 2 (A) and d 2 (<5). 



Error 

Variance 


Interval 


Design 


p x i 


i : p 


p x i x h 


p X (i : h) 


(i :p) x h 


« ! (A) 


[L,oo) 


.942-. 951 


.941-.953 


.935-. 959 


.938-. 950 


.937-.955 


[0,C/] 


.948-. 952 


.949-.953 


.946-. 955 


.947-. 954 


.948-.953 


[L,U] 


.901-. 905 


.900-.905 


.896-. 905 


.899-. 903 


.900-. 906 


a 2 (6) 


[L,oo) 


.944-. 950 


.941-. 951 


.943-. 954 


.944-. 952 


.939-. 952 


M 


.947-. 952 


.949-. 953 


.946-. 960 


.947-. 957 


.951-. 958 


[L,U] 


.902-. 908 


.899-. 908 


.898-. 908 


.897-.902 


.902-. 905 



Error 

Variance 


Interval 


Design 


i : (p x h) 


(i x h) : p 


i : h : p 


p x i x h x o 


(p x i x h) : 0 


« ! (A) 


[L,oo) 


.944-. 954 


.945-.950 


.939-.947 


.928-. 960 


,935-.956 


[0,C/] 


.946-.957 


.950-.959 


.946-.954 


,945-.964 


.948-.962 


[L,U] 


.899-.909 


.904-.915 


.897-.902 


.889-.928 


.900-.916 


a 2 (6) 


[L,oo) 


.944-.960 


.935-.957 


.943-.959 


.924-.972 


.934-.964 


[0,£/] 


.947-.961 


.945-.959 


.948-. 958 


.943-.977 


.941-.967 


[L,U] 


.896-.918 


.894-.908 


.897-.905 


.890-.942 


.893- .925 



Dividing all terms by aQ d utilizing the fact that TiiS?/di, i = 1, are in- 

dependently distributed chi-square distributions for balanced, random, normal probability 
models, the inequality becomes a function of CiE(S 2 ) / 52^ CiE(Sf) . Notice that if all the 
as are equal, as is the case with variance components, then the inequality reduces to a 
function of E(S 2 )/Yii = i E(S 2 ). This is an explicit assumption in Ting et al. (1989) and an 
implicit assumption in Ting et al. (1990). Though the assumption doesn’t impinge on the 
confidence interval tests associated with variance components, it does affect tests involving 
a 2 (A) and <j 2 (<5), since all CjS are not equal in those cases. 

The study examined several combinations of ra*, i = 1, which yield different 

Cj, and pi = CiE(Sf) / Yli = i CjE(S 2 ) > 0. Because not all the CjS are equal, the simulation 
study placed particular emphasis on non-standard values of pi. Dependent upon whether 
the linear combination of expected mean squares for <r 2 ( A) and <t 2 (£) contained all positive 
or both negative and positive signs, a SAS program produced 10,000 values associated 
with Expressions 2 and 6 using the random number generator for the gamma distribution, 
RANGAM, with a = v/2 degrees of freedom and /? = 2. These were used to determine 
the number of lower, upper and two-sided intervals containing Y?q=i Pq ~ Z)r=j+i Pr • Based 
on the simulation study involving 10,000 replications and the normal approximation to the 
binomial distribution, the chance that the simulated values differ from the actual values 
in magnitude by more than 0.004 is less than 5 percent. Table 1 provides results of the 
simulation study. The range of values for each error variance and design represents the 
variation across different pi and rii combinations. 
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The results parallel those found by Ting et al. (1989) and Ting et al. (1990). Overall, 
the lower one-sided and two-sided confidence intervals either held at the designated a- level 
or were conservative. The performance of the upper one-sided intervals is not as good, 
sometimes yielding liberal confidence coefficients. Because of the excellent results, both for 
error variances and variance components, the intervals provided by Inequalities 2 and 6 are 
excellent candidates for use with generalizability theory applications. Indeed, because of 
the excellent performance of the lower one-sided intervals and because upper bounds for 
variance components and error variances provide more crucial information than do lower 
bounds, the lower one-sided intervals provided by (2) and (6) are particularly appropriate 
for generalizability theory applications. The following section provides an application of 
these confidence intervals to a situation involving cut-scores. 

An application 

In a number of circumstances, the numerical score received by a student on a test 
has a set of standards applied to it. In some instances, these standards provide a cut- 
score, above which the students pass and below which the students don’t. Classification 
errors within this pass-or-fail scenario are directly proportional to the standard error of 
the measurement instrument, d' 2 (A). Clearly, as sampling error increases the variability of 
^ 2 (A), the potential for classification errors increases. The following is an investigation of 
the extent to which misclassification increases using the above tested confidence intervals. 

Data collected from 100 students responding to five different essay prompts were used 
to test misclassification rates with respect to the variability of error variances. Three raters 
assessed each prompt from each student and issued scores ranging from one to nine. Using 
mean squares available from the GENOVA output, confidence intervals about d 2 (A) were 
derived using the same number of raters and tasks originally provided in the G study, 
n 'i = 5andn' = 3, as well as with = 10andn(. = 3. A Mathematica notebook designed 
for computing said intervals performed the necessary computations. Borrowing an efficient 
graphical depiction from Cronbach et al. (1995), results are presented in Figure 1. 

The gullwings in Figures 1(a) and 1(b) are truncated normal ogives reflected about 
the cut-score of seven. The outer gullwing in each case represents the upper bound on the 
lower 95 percent confidence interval about a 2 (A), represented by the inner gullwing. Lower 
confidence intervals provide the most relevant information in almost all generalizability 
applications for the simple reason that knowing how small d 2 (A) is isn’t nearly as important 
as knowing how large it is. In Figure 1(a), with = 5 and n' r = 3, d 2 (A) = .293 whereas the 
outer gullwing was constructed from a normal distribution with standard deviation equal 
to .674. The main reason for such a large upper bound for the lower 95 percent confidence 
interval was a large mean square associated with the task effect. By doubling the number 
of tasks and leaving the number of raters fixed at three, the standard deviation associated 
with the outer gullwing is reduced to .544, as shown in Figure 1(b). The interaction between 
increasing and decreasing the number of D study sampled conditions and the upper and 
lower bounds for the confidence intervals is highly nonlinear and difficult to predict. 

In either case, the results are troubling. If, for example, this test determined those 
passing a writing course versus those failing. Across repeated administrations of the test 
with respect to the intended universe of generalization, one should expect highly variable 
misclassification rates. Misclassification rates based upon a 2 (A) = .3 might be considered 




9 



CONFIDENCE INTERVALS IN GENERALIZABILITY THEORY 



9 




(a) Depiction of 0 < <r 2 ( A) < U where <7 2 (A) = 
.293 and U = .674 



Figure L Absolute standard error with respect 
D studies. 




(b) Depiction of 0 < <t 2 (A) < U where <t 2 (A) = 
.293 and U = .544 



the 95 percent lower confidence interval in two 



tenable under certain circumstances - if a score of 6.5 represents the highest failing score, 
then one should expect less than five percent of those students with a passing score of seven 
to fail. Yet, misclassification rates based upon £ 2 (A) = .6 are almost always indefensible - 
nearly 20 percent of those student with a passing score of seven fail. The importance here 
being to recognize the amount of variability possible across administrations and its impact 
upon misclassification. 



Conclusions 

Sampling error in variance components and error variances used in generalizability 
theory is often overlooked when making determinations about the accuracy of a given test. 
This paper attempts to address this shortcoming by presenting a number of results with 
respect to confidence intervals about linear combinations of expected mean squares appropri- 
ate for generalizability theory. Simulation results indicate that these intervals, particularly 
the two-sided and one-sided lower intervals, are accurate or conservative both in simple and 
complex designs with varying amounts of difference in degrees of freedom across effects. In 
practice the lower intervals might prove to be most relevant since knowing how potentially 
large an error variance is is important, while knowing how small it is isn’t. The potential 
for grossly underestimating standard errors seems a real possibility. More analysis using 
larger data sets is needed. 
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